F-test (Variance Ratio Test)#
The two-sample F-test for variances answers a simple question:
Do these two groups have the same variability?
It is also the distribution behind the F-statistic you see in ANOVA and linear regression, but in this notebook we focus on the most direct version: testing whether two population variances are equal.
Learning goals#
Understand what the F-test is testing (and what it is not testing)
Build intuition for where the F distribution comes from (chi-square → ratio)
Implement the test with NumPy only (Monte Carlo p-values + critical values)
Learn how to interpret p-values, rejection regions, and a variance-ratio confidence interval
See common pitfalls (especially non-normality / outliers)
import math
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
pio.templates.default = 'plotly_white'
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(42)
How sample size affects the F distribution#
The degrees of freedom (df₁ = n₁−1, df₂ = n₂−1) control how much the variance estimates fluctuate.
Small samples (small df) → wide, heavy-tailed distribution
Large samples (large df) → concentrated near 1 (variance estimates stabilize)
Here are a few F pdf curves to make that shape change visible.
df_pairs = [(5, 5), (10, 10), (30, 30), (5, 30)]
x_grid = np.linspace(1e-4, 5.0, 900)
fig = go.Figure()
for d1, d2 in df_pairs:
fig.add_trace(
go.Scatter(
x=x_grid,
y=f_pdf(x_grid, d1, d2),
mode='lines',
name=f'df1={d1}, df2={d2}',
)
)
fig.add_vline(x=1.0, line_dash='dash', line_color='black')
fig.update_layout(
title='How degrees of freedom shape the F distribution',
xaxis_title='F',
yaxis_title='density',
)
fig.show()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[2], line 9
4 fig = go.Figure()
5 for d1, d2 in df_pairs:
6 fig.add_trace(
7 go.Scatter(
8 x=x_grid,
----> 9 y=f_pdf(x_grid, d1, d2),
10 mode='lines',
11 name=f'df1={d1}, df2={d2}',
12 )
13 )
15 fig.add_vline(x=1.0, line_dash='dash', line_color='black')
16 fig.update_layout(
17 title='How degrees of freedom shape the F distribution',
18 xaxis_title='F',
19 yaxis_title='density',
20 )
NameError: name 'f_pdf' is not defined
1) What problem does the F-test solve?#
Sometimes the mean is not the main issue. You might care more about spread:
Does a new manufacturing process reduce variability?
Is sensor A noisier than sensor B?
Are two groups equally consistent, even if their averages match?
A variance F-test is a hypothesis test about population variances (σ²), based on sample variances (s²).
Let’s start with a picture: two groups with the same center, but different spread.
n_a, n_b = 40, 60
a = rng.normal(loc=0.0, scale=1.0, size=n_a)
b = rng.normal(loc=0.0, scale=1.8, size=n_b)
fig = go.Figure()
fig.add_trace(go.Violin(y=a, name='Group A', box_visible=True, meanline_visible=True))
fig.add_trace(go.Violin(y=b, name='Group B', box_visible=True, meanline_visible=True))
fig.update_layout(
title='Same center, different spread',
yaxis_title='value',
)
fig.show()
2) The hypothesis, assumptions, and what the result means#
Hypotheses#
For two independent samples (group 1 and group 2):
H₀: σ₁² = σ₂² (equal population variances)
H₁ (two-sided): σ₁² ≠ σ₂²
H₁ (one-sided): σ₁² > σ₂² or σ₁² < σ₂²
Assumptions (important)#
The classic F-test for variances is derived under:
Independence within and across the two samples
Normality: each group is (approximately) normally distributed
This test is very sensitive to outliers and non-normal/heavy-tailed data. If normality is doubtful, consider more robust alternatives such as Levene, Brown–Forsythe, or Fligner–Killeen.
What does rejecting H₀ mean?#
Rejecting H₀ means: the observed difference in sample variances is unlikely if the true variances were equal (given the assumptions).
Failing to reject H₀ does not prove variances are equal; it usually means the data are not strong enough to detect a difference.
3) Where does the F distribution come from?#
For a Normal sample of size n, the sample variance is:
A key result (for Normal data) is:
So if you take two independent samples (sizes n₁, n₂), then under H₀: σ₁² = σ₂², the ratio of sample variances
follows an F distribution with degrees of freedom:
df₁ = n₁ - 1 (numerator)
df₂ = n₂ - 1 (denominator)
Let’s verify these building blocks with simulation.
def f_rvs(df1: int, df2: int, size: int, rng: np.random.Generator) -> np.ndarray:
"""Sample from an F(df1, df2) distribution using the chi-square ratio definition."""
num = rng.chisquare(df1, size=size) / df1
den = rng.chisquare(df2, size=size) / df2
return num / den
def f_pdf(x: np.ndarray, df1: int, df2: int) -> np.ndarray:
"""Analytic F(df1, df2) pdf using only stdlib + NumPy (no SciPy)."""
x = np.asarray(x, dtype=float)
a = df1 / 2.0
b = df2 / 2.0
log_beta = math.lgamma(a) + math.lgamma(b) - math.lgamma(a + b)
log_coeff = a * math.log(df1 / df2) - log_beta
with np.errstate(divide='ignore'):
log_pdf = (
log_coeff
+ (a - 1.0) * np.log(x)
- (a + b) * np.log1p((df1 / df2) * x)
)
pdf = np.exp(log_pdf)
return np.where(x > 0, pdf, 0.0)
# 1) Scaled sample variance ~ chi-square (Normal data)
n = 12
sigma = 2.0
n_sim = 50_000
samples = rng.normal(loc=0.0, scale=sigma, size=(n_sim, n))
s2 = samples.var(axis=1, ddof=1)
scaled_s2 = (n - 1) * s2 / (sigma**2)
chi = rng.chisquare(df=n - 1, size=n_sim)
fig = go.Figure()
fig.add_trace(
go.Histogram(
x=scaled_s2,
nbinsx=90,
histnorm='probability density',
name='(n-1)s^2/σ^2 from samples',
opacity=0.6,
)
)
fig.add_trace(
go.Histogram(
x=chi,
nbinsx=90,
histnorm='probability density',
name='Chi-square(df=n-1)',
opacity=0.6,
)
)
fig.update_layout(
title='Scaled sample variance follows a chi-square distribution (Normal data)',
barmode='overlay',
xaxis_title='value',
yaxis_title='density',
)
fig.show()
# 2) Ratio of (chi-square/df) terms ~ F
n1, n2 = 10, 25
df1, df2 = n1 - 1, n2 - 1
f_sim = f_rvs(df1, df2, size=n_sim, rng=rng)
x_max = np.quantile(f_sim, 0.995)
x_grid = np.linspace(1e-4, x_max, 700)
fig = go.Figure()
fig.add_trace(
go.Histogram(
x=f_sim,
nbinsx=120,
histnorm='probability density',
name='Simulated F',
opacity=0.6,
)
)
fig.add_trace(go.Scatter(x=x_grid, y=f_pdf(x_grid, df1, df2), mode='lines', name='Analytic pdf'))
fig.update_layout(
title=f'F distribution from a chi-square ratio (df1={df1}, df2={df2})',
barmode='overlay',
xaxis_title='F value',
yaxis_title='density',
)
fig.show()
4) The variance F-test: statistic, p-value, and critical region#
Given samples x and y:
Compute unbiased sample variances: s₁², s₂² (using
ddof=1)Test statistic:
Degrees of freedom: df₁ = n₁ - 1, df₂ = n₂ - 1
Interpreting F_obs#
If F_obs ≈ 1, the sample variances are similar.
If F_obs is large, group 1 looks more variable than group 2.
If F_obs is small (≪ 1), group 1 looks less variable.
Two-sided decision rule (α = 0.05)#
Reject H₀ if F_obs lands in either tail:
In practice, the p-value is:
We’ll compute these probabilities with a Monte Carlo approximation (NumPy only).
def f_test_variances(
x: np.ndarray,
y: np.ndarray,
*,
alternative: str = 'two-sided',
alpha: float = 0.05,
n_sim: int = 300_000,
seed: int = 0,
) -> dict:
"""Two-sample F-test for equal variances using a Monte Carlo reference distribution.
Parameters
----------
x, y:
1D samples.
alternative:
'two-sided', 'greater' (σ₁² > σ₂²), or 'less' (σ₁² < σ₂²).
alpha:
Significance level.
n_sim:
Number of Monte Carlo draws from F(df1, df2).
seed:
RNG seed (for reproducibility).
"""
x = np.asarray(x, dtype=float)
y = np.asarray(y, dtype=float)
if x.ndim != 1 or y.ndim != 1:
raise ValueError('x and y must be 1D arrays.')
if len(x) < 2 or len(y) < 2:
raise ValueError('Need at least 2 observations per group.')
if not (0.0 < alpha < 1.0):
raise ValueError('alpha must be in (0, 1).')
n1, n2 = len(x), len(y)
df1, df2 = n1 - 1, n2 - 1
s1_sq = x.var(ddof=1)
s2_sq = y.var(ddof=1)
f_obs = s1_sq / s2_sq
rng_local = np.random.default_rng(seed)
f_sim = f_rvs(df1, df2, size=n_sim, rng=rng_local)
p_right = np.mean(f_sim >= f_obs)
p_left = np.mean(f_sim <= f_obs)
alternative_norm = alternative.strip().lower()
if alternative_norm in {'two-sided', 'two_sided', 'two sided'}:
alternative_canonical = 'two-sided'
p_value = 2.0 * min(p_left, p_right)
p_value = float(min(p_value, 1.0))
q_low, q_high = np.quantile(f_sim, [alpha / 2.0, 1.0 - alpha / 2.0])
critical_values = (float(q_low), float(q_high))
reject = bool((f_obs < q_low) or (f_obs > q_high))
# CI for variance ratio σ₁² / σ₂² (Monte Carlo quantiles)
ci_low = float(f_obs / q_high)
ci_high = float(f_obs / q_low)
ci = (ci_low, ci_high)
elif alternative_norm in {'greater', 'right', 'larger'}:
alternative_canonical = 'greater'
p_value = float(p_right)
q_high = float(np.quantile(f_sim, 1.0 - alpha))
critical_values = (q_high,)
reject = bool(f_obs > q_high)
ci = None
elif alternative_norm in {'less', 'left', 'smaller'}:
alternative_canonical = 'less'
p_value = float(p_left)
q_low = float(np.quantile(f_sim, alpha))
critical_values = (q_low,)
reject = bool(f_obs < q_low)
ci = None
else:
raise ValueError("alternative must be one of: 'two-sided', 'greater', 'less'.")
# Monte Carlo standard error for the estimated p-value (rough sense of noise)
mc_se = float(np.sqrt(p_value * (1.0 - p_value) / n_sim))
return {
'n1': n1,
'n2': n2,
'df1': df1,
'df2': df2,
's1_sq': float(s1_sq),
's2_sq': float(s2_sq),
'f_obs': float(f_obs),
'alternative': alternative_canonical,
'alpha': float(alpha),
'p_value': float(p_value),
'mc_se': mc_se,
'critical_values': critical_values,
'reject_h0': reject,
'ci_var_ratio': ci,
'n_sim': int(n_sim),
'seed': int(seed),
}
def plot_f_test(result: dict, *, x_max_quantile: float = 0.997) -> go.Figure:
df1 = int(result['df1'])
df2 = int(result['df2'])
f_obs = float(result['f_obs'])
alpha = float(result['alpha'])
alternative = result['alternative']
rng_plot = np.random.default_rng(123)
f_sim_plot = f_rvs(df1, df2, size=200_000, rng=rng_plot)
x_max = float(np.quantile(f_sim_plot, x_max_quantile))
x_max = max(x_max, f_obs) * 1.1
x_grid = np.linspace(1e-4, x_max, 900)
pdf = f_pdf(x_grid, df1, df2)
fig = go.Figure()
fig.add_trace(go.Scatter(x=x_grid, y=pdf, mode='lines', name=f'F pdf (df1={df1}, df2={df2})'))
# Shade rejection region(s)
if alternative == 'two-sided':
q_low, q_high = result['critical_values']
left = x_grid <= q_low
right = x_grid >= q_high
fig.add_trace(
go.Scatter(
x=x_grid[left],
y=pdf[left],
mode='lines',
fill='tozeroy',
name=f'Reject (α/2={alpha/2:.3f})',
line=dict(color='rgba(239, 85, 59, 1.0)'),
fillcolor='rgba(239, 85, 59, 0.25)',
)
)
fig.add_trace(
go.Scatter(
x=x_grid[right],
y=pdf[right],
mode='lines',
fill='tozeroy',
name=f'Reject (α/2={alpha/2:.3f})',
line=dict(color='rgba(239, 85, 59, 1.0)'),
fillcolor='rgba(239, 85, 59, 0.25)',
showlegend=False,
)
)
fig.add_vline(x=q_low, line_dash='dot', line_color='rgba(239, 85, 59, 0.9)')
fig.add_vline(x=q_high, line_dash='dot', line_color='rgba(239, 85, 59, 0.9)')
elif alternative == 'greater':
(q_high,) = result['critical_values']
right = x_grid >= q_high
fig.add_trace(
go.Scatter(
x=x_grid[right],
y=pdf[right],
mode='lines',
fill='tozeroy',
name=f'Reject (α={alpha:.3f})',
line=dict(color='rgba(239, 85, 59, 1.0)'),
fillcolor='rgba(239, 85, 59, 0.25)',
)
)
fig.add_vline(x=q_high, line_dash='dot', line_color='rgba(239, 85, 59, 0.9)')
elif alternative == 'less':
(q_low,) = result['critical_values']
left = x_grid <= q_low
fig.add_trace(
go.Scatter(
x=x_grid[left],
y=pdf[left],
mode='lines',
fill='tozeroy',
name=f'Reject (α={alpha:.3f})',
line=dict(color='rgba(239, 85, 59, 1.0)'),
fillcolor='rgba(239, 85, 59, 0.25)',
)
)
fig.add_vline(x=q_low, line_dash='dot', line_color='rgba(239, 85, 59, 0.9)')
else:
raise ValueError('Unknown alternative in result dict.')
fig.add_vline(x=f_obs, line_dash='dash', line_color='black')
fig.add_annotation(
x=f_obs,
y=float(np.interp(f_obs, x_grid, pdf)),
text=f"F_obs={f_obs:.3f}<br>p≈{result['p_value']:.4f}",
showarrow=True,
arrowhead=2,
ax=40,
ay=-40,
bgcolor='rgba(255,255,255,0.9)',
)
fig.update_layout(
title='F-test: reference distribution and rejection region',
xaxis_title='F',
yaxis_title='density',
legend_title='Region',
)
return fig
5) Worked example: equal variances (H₀ true)#
We will sample two Normal groups with the same standard deviation (σ₁ = σ₂). In this situation, we expect the test to reject only about α of the time.
For a single run, the p-value can land anywhere from 0 to 1 (randomness!), but on average it should not look systematically small.
n1, n2 = 18, 22
x = rng.normal(loc=0.0, scale=1.2, size=n1)
y = rng.normal(loc=0.0, scale=1.2, size=n2)
res_equal = f_test_variances(x, y, alternative='two-sided', alpha=0.05, n_sim=400_000, seed=1)
res_equal
{'n1': 18,
'n2': 22,
'df1': 17,
'df2': 21,
's1_sq': 1.781384820492569,
's2_sq': 1.6656962640243989,
'f_obs': 1.0694535726391445,
'alternative': 'two-sided',
'alpha': 0.05,
'p_value': 0.87165,
'mc_se': 0.0005288579145195805,
'critical_values': (0.3850427399035428, 2.4863470025297367),
'reject_h0': False,
'ci_var_ratio': (0.43013045707257586, 2.777493150259251),
'n_sim': 400000,
'seed': 1}
plot_f_test(res_equal).show()
6) Worked example: different variances (H₁ true)#
Now we generate two groups where the second group is noisier (larger σ). This should push F_obs away from 1 and often produce a small p-value.
n1, n2 = 18, 22
x = rng.normal(loc=0.0, scale=1.0, size=n1)
y = rng.normal(loc=0.0, scale=2.0, size=n2)
res_diff = f_test_variances(x, y, alternative='two-sided', alpha=0.05, n_sim=400_000, seed=2)
res_diff
{'n1': 18,
'n2': 22,
'df1': 17,
'df2': 21,
's1_sq': 0.7968991166284403,
's2_sq': 4.228029458951296,
'f_obs': 0.18848002937664016,
'alternative': 'two-sided',
'alpha': 0.05,
'p_value': 0.00098,
'mc_se': 4.947321497537834e-05,
'critical_values': (0.38428465507232556, 2.4922479853766086),
'reject_h0': True,
'ci_var_ratio': (0.07562651489039465, 0.490469829822288),
'n_sim': 400000,
'seed': 2}
plot_f_test(res_diff).show()
7) Interpreting the output (what it exactly means)#
From the result dictionary:
f_obs = s1_sq / s2_sqis the observed variance ratio.p_valueis the probability (under H₀) of seeing an F statistic at least as extreme as the observed one.reject_h0tells you whether the observed statistic lands in the rejection region for your chosen α.ci_var_ratiois a (Monte Carlo) confidence interval for σ₁² / σ₂².
A practical reading#
If
p_value < α: the data are inconsistent with equal variances (under the Normality assumption).If
p_value ≥ α: you do not have enough evidence to claim the variances differ.
A helpful companion to the p-value is the confidence interval:
If the CI for σ₁² / σ₂² includes 1, equal variances are plausible.
If the CI is entirely above 1, variance 1 is likely larger.
If the CI is entirely below 1, variance 1 is likely smaller.
8) Pitfall demo: heavy tails can break the test#
Even when the true variances are equal, non-normal data can cause the F-test to reject too often.
Below we compare the empirical false positive rate (Type I error) under:
Normal data (assumptions match)
Heavy-tailed data (Student-t with df=3, scaled to variance 1)
alpha = 0.05
n1, n2 = 20, 20
df1, df2 = n1 - 1, n2 - 1
# Critical region under the F(df1, df2) model (computed once)
rng_crit = np.random.default_rng(0)
f_crit_sim = f_rvs(df1, df2, size=600_000, rng=rng_crit)
q_low, q_high = np.quantile(f_crit_sim, [alpha / 2.0, 1.0 - alpha / 2.0])
n_rep = 8_000
# Case A: Normal (variance 1)
x_norm = rng.normal(0.0, 1.0, size=(n_rep, n1))
y_norm = rng.normal(0.0, 1.0, size=(n_rep, n2))
f_norm = x_norm.var(axis=1, ddof=1) / y_norm.var(axis=1, ddof=1)
reject_norm = np.mean((f_norm < q_low) | (f_norm > q_high))
# Case B: Heavy-tailed t(df=3), scaled to variance 1
# Var(t_df) = df/(df-2) for df>2 -> 3/(1)=3, so divide by sqrt(3) to get variance ~1
x_t = rng.standard_t(df=3, size=(n_rep, n1)) / np.sqrt(3.0)
y_t = rng.standard_t(df=3, size=(n_rep, n2)) / np.sqrt(3.0)
f_t = x_t.var(axis=1, ddof=1) / y_t.var(axis=1, ddof=1)
reject_t = np.mean((f_t < q_low) | (f_t > q_high))
fig = px.bar(
x=['Normal', 'Student-t df=3 (heavy-tailed)'],
y=[reject_norm, reject_t],
title='Type I error inflation when Normality is violated',
labels={'x': 'data generating distribution (true variances equal)', 'y': f'fraction rejected (α={alpha})'},
)
fig.add_hline(y=alpha, line_dash='dash', line_color='black')
fig.update_layout(yaxis_range=[0, max(alpha * 2.5, reject_t * 1.15)])
fig.show()
reject_norm, reject_t
(0.047375, 0.292375)
Takeaway#
If your data are plausibly non-normal or contain outliers, the classic F-test can be misleading.
In that case, look at robust variance tests (Levene / Brown–Forsythe / Fligner–Killeen) and always pair any test with plots (box/violin, histograms, QQ-plots).
9) Connection to ANOVA / regression (why F shows up everywhere)#
In ANOVA and linear regression, an F-statistic typically compares two variance estimates:
a variance explained by a model/effect (signal)
a residual variance not explained (noise)
That ratio also follows an F distribution under suitable assumptions. So even though this notebook focuses on the variance-ratio test, the same distribution is doing a lot of work across classical statistics.
Exercises#
Swap the groups (
xandy) in the worked examples. What happens tof_obs? What happens to a two-sided p-value?Increase sample sizes and observe how the F distribution tightens around 1 under H₀.
Create data with a few extreme outliers and see how quickly the test starts rejecting.
References#
F distribution (definition as ratio of chi-square variables)
Variance ratio confidence interval derived from F quantiles
Robust alternatives: Levene, Brown–Forsythe, Fligner–Killeen